CPSC 545/445 (Autumn 2003) - Class 9: Phylogenetic Tree Construction Module 3, Part 3 --- recap (very briefly) - parsimony, particularly big parsimony problem = searching over tree topologies - simultaneous alignment and phylogenetic tree inference --- Maximum Likelihood Approach for Phylogenetic Tree Construction Key ideas: - use a probabilistic model of evaluation - for given sequences X, find a tree T with maximal P(X | T) Use similar approach as above, i.e., solve the following two problems: - for given sequences X and tree T, determine P(X | T) under given model of evolution - search over trees in order to find overall optimum Note: unlike above, in this section we will consider rooted trees How to determine P(X | T): - use Felsenstein's algorithm [1981], which iteratively computes for nodes of tree, sweeping tree from leaves to root (similar to parsimony algorithm) - detailes later How to search over trees in order to find overall optimum: - use same methods as for large parsimony problem (see above) (but have to additionally find optimal edge lengths for given topology) - can also use probabilistic sampling methods, e.g., Metropolis algorithm (see Durbin et al., p.207ff) Before we provide details on Felsentein's algorithm ... -- Probabilistic models of evoluation Assumptions: - each site evolves independently - no insertions and deletions - DNA sequences want: probabilities P(x|y,t) = prob that base y evolves into base x in time t consider substitution matrix S(t) = [P(a_i|a_j,t)]_ij assume: S(t+s) = S(t)*(s) (this is satisfied if the underlying probalistic process is Markovian and stationary) Jukes-Cantor Model [1969]: P(X|X,t) = 1/4 * (1 + 3*exp(-4*alpha t) P(X|Y,t) = 1/4 * (1 - exp(-4*alpha t) (derived from the assumption that rate of X->Y is equal for all X,Y with X!=Y; implies nucleotide equilibrium frequencies of 1/4 for all bases) More realistic models exist, e.g. Kimura model [1980] that allows different rates (and hence probabilities) for transitions (A<->G, C<->T) and transversions (all others) For protein sequences, a slightly generalised form (for arbitrary time t) of the PAM matrices can be used. -- How to determine P(X | T) - for two sequences: tree T with only two leaves x1,x2, two edges (length t1,t2): x1 <-t1- a -t2-> x2 P(x1,x2,a|T,t1,t2) = q_a * P(x1|a,t1) * P(x2|a,t2) q_a = equilibrium probability of a (from substitution matrix) P(x1|a,t1) = prob that a evolves into x1 in time t1 P(x1,x2|T,t1,t2) = Sum_a P(x1,x2,a|T,t1,t2) = Sum (q_a * P(x1|a,t1) * P(x2|a,t2)) (sum over all possibilities for residue a) With N sites, take product over each individual site (assumption: all sites evolve independently) [example? see Durbin et al., p.198f] -- How to determine P(X | T) - for n sequences: given: sequences x1...xn, tree T, edge lengths t label leaf nodes 1...n, non-leaf nodes n+1...2n-1 pi(i) = parent of node i obtain P(x1,...,xn|T,t) by summing probability of specific instantiation of tree (i.e., assignmennt of residues to internal node) under given evolutionary model over all possible assignments of residues to non-leaf nodes P(x1,...,xn|T,t) = sum_{assignment of non-leaf nodes n+1...2n-1} q_{root residue} * product_{non-leaf nodes i} P(a_i | a_pi(i),t_i) * product_{leaf nodes i} P(x_i | a_pi(i),t_i) Note: this works because given an instantiation of all internal node residues, evolution along each each is independent of evolution along all other edges To compute this efficiently, use post-order traversal of tree (bottom to top, starting from leaves) to obtain P(L_k|a) = probability of all leaves beneath node k given node k is assigned a Felsenstein's Algorithm: 1. k=2n-1 2. recursively compute P(L_k|a) for all a as follows if k is a leaf node then P(L_k|a) := 1 if a=x_k, 0 otherwise else compute P(L_i|a), P(L_j|a) for all a at daughter nodes i,j of k P(L_k|a) := sum_{b,c} P(b|a,t_i) * P(L_i|b) * P(c|a,t_j) * P(L_j|c) 3. P(x|T,t) := sum_a (q_a * P(L_{2n-1}|a)) Note: P(b|a,t_i) and P(c|a,t_j) are calculated according to the given model of evolution again, with N sites, take product over each individual site (assumption: all sites evolve independently) [example, see Durbin et al., p.192f, p.201f] -- How to search over trees in order to find overall optimum: - use same methods as for large parsimony problem (see above) but: have to additionally find optimal edge lengths for given topology for a given tree, this can be done using numerical optimisation techniques (newton's method, conjugate gradient methods, expectation-maximisation) overall maximum likelihood procedure in this case: - search over tree topologies - for each candidate tology, find edge lengths that maximise likelihood (this will typically need a subroutine for determining the likelihood of the given taxa given the tree topology and edge lengths -> may use felsenstein's algorithm) - alternative: use probabilistic sampling methods, e.g., Metropolis algorithm (see Durbin et al., p.207ff) fundamental idea (can be seen as special case of SLS): start with abritrary tree T,t iterate: given a tree T,t probabilistically generate a new candidate tree T',t' P := P(T,t|x); P' := P(T',t'|x); if P' > P then T,t := T',t' else T,t := T',t' with probability P'/P T' is generated from T according to randomised local transformation of edge lengths and topology calculation of P(T,t|x) is done via Bayes rule from P(x|T,t), prior P(T,t) - this can be, e.g., uniform over all trees (don't need P(x), since this is just a normalisation factor) -- the max-likelihood approach can be generalised to more realistic models of evolution, specifically, site dependent mutation rates and insertions/deletions, but this is beyond the scope of this course. (But: see Durbin et al., Section 8.5) --- Resources: Durbin et al., Ch.8, Sections 8.1-8.3, beginning of 8.4 Additional material: @incollection{ cottamoscato02:phylotrees:ppsn, author = "Cotta, C. and Moscato, P.", title = "Inferring Phylogenetic Trees Using Evolutionary Algorithms", booktitle = "Parallel Problem Solving From Nature VII", editor = "Merelo, J.J. and Adamidis, P. and Beyer, H.-G. and Fern\'andez-Villaca\~nas, J.-L. and Schwefel, H.-P.", series = "Lecture Notes in Computer Science", volume = "2439", pages = "720--729", publisher = "Springer-Verlag", address = "Berlin", year = "2002", url = "citeseer.nj.nec.com/cotta02inferring.html" } @misc{ salter-algorithms, author = "L. A. Salter", title = "Algorithms for Phylogenetic Tree Reconstruction", url = "citeseer.nj.nec.com/salter00algorithms.html" } http://biohpc.learn.in.th/contents2002/Worawit/worawit.pdf - nice slide set, but contains some errors http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-TreeSearch/Phylogeny-Search.html ---